Consensus Graph Representation Learning for Better Grounded Image Captioning

نویسندگان

چکیده

The contemporary visual captioning models frequently hallucinate objects that are not actually in a scene, due to the misclassification or over-reliance on priors resulting semantic inconsistency between information and target lexical words. most common way is encourage model dynamically link generated object words phrases appropriate regions of image, i.e., grounded image (GIC). However, GIC utilizes an auxiliary task (grounding objects) has solved key issue hallucination, inconsistency. In this paper, we take novel perspective above: exploiting coherency language modalities. Specifically, propose Consensus Rraph Representation Learning framework (CGRL) for incorporates consensus representation into pipeline. learned by aligning graph (e.g., scene graph) consider both nodes edges graph. With aligned consensus, can capture correct linguistic characteristics relevance, then grounding further. We validate effectiveness our model, with significant decline hallucination (-9% CHAIRi) Flickr30k Entities dataset. Besides, CGRL also evaluated several automatic metrics human evaluation, results indicate proposed approach simultaneously improve performance (+2.9 Cider) (+2.3 F1LOC}).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Contrastive Learning for Image Captioning

Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of captions, as distinctive captions are more likely to describe images with their unique aspects. In this work, we propose a new learning method, Contrastive Learn...

متن کامل

Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multistage prediction framework for image captioning, composed of multiple decoders each of which...

متن کامل

Learning to Guide Decoding for Image Captioning

Recently, much advance has been made in image captioning, and an encoder-decoder framework has achieved outstanding performance for this task. In this paper, we propose an extension of the encoder-decoder framework by adding a component called guiding network. The guiding network models the attribute properties of input images, and its output is leveraged to compose the input of the decoder at ...

متن کامل

Learning to Evaluate Image Captioning

Evaluation metrics for image captioning face two challenges. Firstly, commonly used metrics such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has well known blind spots to pathological caption constructions, and rulebased metrics lack provisions to repair such blind spots once identified. For example, the newly proposed SPICE correlate...

متن کامل

A Distributed Representation Based Query Expansion Approach for Image Captioning

In Figure 1, we present more example results obtained with our approach on the benchmark datasets Flickr8K (Hodosh et al., 2013), Flickr30K (Young et al., 2014), MS COCO (Lin et al., 2014). We also provide groundtruth human descriptions for comparison. There are some cases where our approach falls short. In some of those cases, although the system does not produce the most desirable results, it...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2021

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v35i4.16452